Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431

annjawn · 2023-09-10T23:12:05Z

This PR introduces enhancements to the Azure Document Intelligence document loader.

Uses paragraphs for creation of full page text introducing efficiency with lesser iterations. Paragraphs are supported by all models.
Supports paragraphs via split_mode during initialization of DocumentIntelligenceLoader. This defaults to page in which case the full text of the page will be returned. If paragraph is used in split_mode then Documents will be returned in chunks by paragraphs. Paragraphs may be useful in generating embeddings in smaller chunks instead of having to split the full page text yet again.
Provides table data extraction if the model specified is either prebuilt-document, prebuilt-layout, or prebuilt-invoice. This is useful for developers who intend to use tables with Self-query.
Introduces a type key in Document metadata to help distinguish just page text vs paragraph vs tables with PAGE, PARAGRAPH, TABLE_HEADER and TABLE_ROW.
For tables, provides the headers and rows in CSV format along with the table index while retaining the page number, which can be used to load vector db for self query. Note: metadata formatting with Document schema for self-query will still be needed which can be done with the help of type key (TABLE_HEADER and TABLE_ROW), table_index, and page.

Sample usage

from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
from langchain.document_loaders.pdf import DocumentIntelligenceLoader

document_analysis_client = DocumentAnalysisClient(endpoint="<endpoint>", credential=AzureKeyCredential("<key>"))

loader = DocumentIntelligenceLoader("./document.pdf",
    client=document_analysis_client,
    model="prebuilt-document",
    split_mode="paragraph"          # optional, defaults to `page`
) 
documents = loader.load()

tables = list()
for doc in documents:
  if doc.metadata['type'] in ['PAGE', 'PARAGRAPH']:
    # page text
    print(f"====Page {doc.metadata['page']} {doc.metadata['type']}-text====\n\n")
    print(doc)
    print("\n\n")
  elif doc.metadata['type'] in ['TABLE_HEADER', 'TABLE_ROW']:
    tables.append(doc)

# first table in the document
table1 = [d for d in tables if d.metadata['table_index'] == 0]
# second table in the document
table2 = [d for d in tables if d.metadata['table_index'] == 1]
# third table in the document
table3 = [d for d in tables if d.metadata['table_index'] == 2]

vercel · 2023-09-10T23:12:08Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Sep 14, 2023 2:47am

LarsAC · 2023-09-11T09:33:47Z

Great addition, @annjawn. I just made an update for DocumentIntelligence which also rolls up paragraphs that are under the same SectionHeading to generate larger documents that are semantically similar per structure of the document. Also from previous experiments it seems that tables and paragraphs from the DI API overlap, so I used the spans to come up with an ordered list of non-overlapping text chunks. Would you mind having a look at https://github.com/LarsAC/langchain/tree/larsac/azure-di?

annjawn · 2023-09-11T09:56:21Z

Hey @LarsAC , yes technically LINES/WORDS/PARAGRAPHS will indeed overlap with TABLE. The idea for including TABLE is to provide a way for people to use that in Self-query. However, we may still want to include it in the text as well (it's probably a matter of just looking at type = PAGE | PARAGRAPH if the user is interested in plain text only, or type = TABLE_HEADER & TABLE_ROW if user is interested in the Tables. I also just added PAGE level and PARAGRAPH level chunking based on which the user prefers using split_mode.

I will definitely take a look at your updates 👍, though I am thinking if we should still provide some flexibility of how the user may want to retrieve the text from the doc.

LarsAC · 2023-09-13T22:08:31Z

@annjawn Fully agree with the flexibility. I had also added a "switch" parameter to the constructor of the loader in order to let the user control how to parse the text. We could likely add more options in parallel.

baskaryan · 2023-09-14T01:19:52Z

libs/langchain/langchain/document_loaders/parsers/pdf.py


-    def __init__(self, client: Any, model: str):
+    def __init__(self, client: Any, model: str, split_mode: str):


could we give this default val, probably "page"? so this isn't a breaking change and default behavior doesn't change too much

I do default it to “page” in DocumentIntelligenceLoader

https://github.com/annjawn/langchain/blob/c97141309583fc70d6447b9fd216e4b48f09722e/libs/langchain/langchain/document_loaders/pdf.py#L615

but can we have default here as well, in case this object is instantiated directly by a user?

Yes we can default "page" here as well @baskaryan

baskaryan · 2023-09-14T01:22:45Z

libs/langchain/langchain/document_loaders/parsers/pdf.py


    def _generate_docs(self, blob: Blob, result: Any) -> Iterator[Document]:
-        for p in result.pages:


if split mode is page should we just keep existing logic? is there value in parsing by paragraph and re-assembling pages?

@baskaryan the idea of providing paragraphs as an option is to do chunking (splitting) as supported by the azure AI cognitive layout capabilities rather than having to do chunking again using, let’s say a Text Splitter. This would be helpful for generating embeddings of chunks (paragraphs) that will retain the semantic consistency of the text. We won’t reassemble the paragraphs back into pages if paragraph is used rather we will keep it the way Doc intel’s layout extracts it. If the user specifies page explicitly or just doesn’t pass the parameter at initialization then page will be defaulted and entire page text will be generated per page. Hope this makes sense.

what i mean is why not do something like

if self.split_mode == "page": for p in result.pages: ... elif self.split_mode == "paragraph": for p in result.paragraphs: ...

to save us having to write logic for reassembling paragraphs into pages in the case that split mode is page

@baskaryan right, I am actually doing this here. the result object doesn't have each page's full text individually in the pages attribute as it may seem, we actually construct pages by concatenating paragraphs. The highest grouped entity that Doc intelligence goes up to is the entire document (all text from all pages concatenated into one) and then its per page paragraph (then lines, then words). The content object in result is combination of all text from all pages, so it's just easier to assemble per page by paragraph instead of trying to split content into individual pages, but that assembly (of paragraphs) will only happen if self.split_mode == "page". Here's a structure for better explanation.

Attaching a sample JSON output from a 2 page document extracted via prebuilt-read model.

output.json.zip

annjawn · 2023-09-14T02:25:13Z

libs/langchain/langchain/document_loaders/pdf.py

+        file_path: str,
+        client: Any,
+        model: str = "prebuilt-document",
+        split_mode: str = "page",


@baskaryan here's where it's defaulted to page, so it won't introduce any breaking change.

annjawn · 2023-09-14T02:36:43Z

libs/langchain/langchain/document_loaders/parsers/pdf.py

+                        "type": "PAGE",
+                    },
+                )
+                yield d


@baskaryan here's page vs paragraph logic. If page is used then we do subsequent collation of paragraphs into individual page's full text and specify "type": "PAGE" in Document. If paragraph is used then we keep it as is and simply yield with paragraphs in the Document schema with "type": "PARAGRAPH"

zifeiq · 2023-11-20T12:19:02Z

Hi @annjawn may I understand what is the plan for this PR? Is the PR going to be updated and merged?

baskaryan · 2024-02-14T05:36:12Z

Apologies for the slow review! Pr has some merge conflicts, happy to re-review if you'd like to resolve

Anjan Biswas added 2 commits September 10, 2023 15:34

Added paragraphs and tables for Azure Doc Intelligence

f2dda1e

Enhance: Added paragraphs and tables for Azure Doc Intelligence

91e8edc

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Sep 10, 2023

Enhance: Added paragraph support

e788e81

Enhance: Added safe CSV for TABLES

c971413

baskaryan reviewed Sep 14, 2023

View reviewed changes

annjawn commented Sep 14, 2023

View reviewed changes

Merge branch 'master' into doc-intel-0.2

ad1a834

annjawn commented Sep 14, 2023

View reviewed changes

vercel bot deployed to Preview September 14, 2023 02:47 View deployment

hwchase17 closed this Jan 30, 2024

baskaryan reopened this Jan 30, 2024

ccurme added community Related to langchain-community langchain Related to the langchain package labels Jun 18, 2024

hwchase17 closed this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431

Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431

annjawn commented Sep 10, 2023 •

edited

Loading

vercel bot commented Sep 10, 2023 •

edited

Loading

LarsAC commented Sep 11, 2023 •

edited

Loading

annjawn commented Sep 11, 2023 •

edited

Loading

LarsAC commented Sep 13, 2023

baskaryan Sep 14, 2023 •

edited

Loading

annjawn Sep 14, 2023 •

edited

Loading

baskaryan Sep 14, 2023

annjawn Sep 24, 2023

baskaryan Sep 14, 2023

annjawn Sep 14, 2023 •

edited

Loading

baskaryan Sep 14, 2023

annjawn Sep 16, 2023 •

edited

Loading

annjawn Sep 16, 2023

annjawn Sep 14, 2023 •

edited

Loading

annjawn Sep 14, 2023

zifeiq commented Nov 20, 2023

baskaryan commented Feb 14, 2024


		def __init__(self, client: Any, model: str):
		def __init__(self, client: Any, model: str, split_mode: str):


		def _generate_docs(self, blob: Blob, result: Any) -> Iterator[Document]:
		for p in result.pages:

Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431

Azure Doc Intelligence 0.2 - support paragraphs and tables for multiple models #10431

Conversation

annjawn commented Sep 10, 2023 • edited Loading

vercel bot commented Sep 10, 2023 • edited Loading

LarsAC commented Sep 11, 2023 • edited Loading

annjawn commented Sep 11, 2023 • edited Loading

LarsAC commented Sep 13, 2023

baskaryan Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

annjawn Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

baskaryan Sep 14, 2023

Choose a reason for hiding this comment

annjawn Sep 24, 2023

Choose a reason for hiding this comment

baskaryan Sep 14, 2023

Choose a reason for hiding this comment

annjawn Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

baskaryan Sep 14, 2023

Choose a reason for hiding this comment

annjawn Sep 16, 2023 • edited Loading

Choose a reason for hiding this comment

annjawn Sep 16, 2023

Choose a reason for hiding this comment

annjawn Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

annjawn Sep 14, 2023

Choose a reason for hiding this comment

zifeiq commented Nov 20, 2023

baskaryan commented Feb 14, 2024

annjawn commented Sep 10, 2023 •

edited

Loading

vercel bot commented Sep 10, 2023 •

edited

Loading

LarsAC commented Sep 11, 2023 •

edited

Loading

annjawn commented Sep 11, 2023 •

edited

Loading

baskaryan Sep 14, 2023 •

edited

Loading

annjawn Sep 14, 2023 •

edited

Loading

annjawn Sep 14, 2023 •

edited

Loading

annjawn Sep 16, 2023 •

edited

Loading

annjawn Sep 14, 2023 •

edited

Loading